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SYSTEM METHOD AND APPARATUS FOR 

PROVIDING LINEARLY SCALABLE 
DYNAMIC MEMORY MANAGEMENT IN A 
MULTIPROCESSING SYSTEM 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to a computer system using 
intelligent input-output, and more particularly, to a system 
and method for providing linearly scalable dynamic memory 
management in a multiprocessing system. 

2. Description of Related Art 

Aconvcntional computer system typically includes one or 
more central processing units (CPUs) capable of executing 
various sequential sets of instructions, known as threads. 
Originally, a computer system included a single CPU 
capable of performing a single thread at a given time. 
Advances in operating systems have provided a technique 
for sharing a single CPU among multiple threads, known as 
multitasking. The development of multiprocessing brought 
computer systems with multiple CPUs, each executing a 
different thread at the same time. 

There are many variations on the basic theme of multi- 
processing. In general, the differences are related to how 
independently the various processors operate and how the 
workload among these processors is distributed. In loosely- 
coupled multiprocessing, the processors execute related 
threads, but, they do so as if they were stand-alone proces- 
sors. Each processor may have its own memory and may 
even have its own mass storage. Further, each processor 
typically runs its own copy of an operating system, and 
communicates with the other processor or processors 
through a message-passing scheme, much like devices com- 
municating over a local-area network. Loosely-coupled mul- 
tiprocessing has been widely used in mainframes and 
minicomputers, but the software to do it is very closely tied 
to the hardware design. For this reason, it has not gained the 
support of software vendors, and is not widely used in PC 
servers. 

In tightly-coupled multiprocessing, by conU-ast, the opera- 
tions of the processors are more closely integrated. They 
typically share memory, and may even have a shared cache. 
The processors may not be identical to each other, and may 
or may not execute similar threads. However, they typically 
share other system resources such as mass storage and 
input/output (I/O)- Moreover, instead of a separate copy of 
the operating system for each processor, they typically run 
a single copy, with the operating system handling the 
coordination of threads between the processors. The sharing 
of system resources makes tightly-coupled multiprocessing 
less expensive, and it is the dominant multiprocessor archi- 
tecture in network servers. 

Hardware architectures for tightly-coupled muhiprocess- 
ing systems can be further divided into two broad categories. 
In symmetrical multiprocessor systems, system resources 
such as memory and disk input/output arc shared by all the 
microprocessors in the system. The woricload is distributed 
evenly to available processors so that one does not sit idle 
while another is loaded with a specific thread. The perfor- 
mance of SMP systems generally increases for all threads as 
more processor units are added. 

An important goal in the design of muUiprocessing sys- 
tems is hnear scalability. In a completely linearly scalable 
system, the performance of the system increases linearly 
with the addition of each CPU. The performance of the 
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system is measured in the number of instructions that the 
system as a whole completes in a given time. However, in 
most multiprocessing systems, as the number of CPUs are 
increased, the performance gain realized by adding an 

5 additional CPU decreases and becomes negligible. 

A common problem with multiprocessing occurs when- 
more than one thread attempts to read or write to a common 
or shared memory. Those skilled in the art will recognize the 
data corruption that would occur if one thread were to read 
a set of memory locations while another thread were to write 
to the same set of memory locations. Common memory 
locations that are frequently accessed by various threads are 
the heap data structure and the free list. A heap is a portion 
of memory that is divided into smaller partitions. Each 

15 partition is allocatablc on demand to store data for the need 
of particular threads. Once the data stored in the partition is 
no longer needed by the thread, the partition is returned to 
the heap. The heap data structure and the free list keep track 
of which partitions arc allocated to the various threads, and 

20 which partitions arc unallocated. When a thread is in need of 
memory, the heap data stmcture and free list are accessed to 
assign an unallocated partition of the heap to the thread. 
When the thread is no longer in need of the partition of 
memory, the partition of memory is remraed to the heap. The 

25 heap data stmcture and free list are updated to reflect that the 
partition of memory is now unallocated. 

The management of concurrent threads is performed by 
the operating system of the computer system which allocates 
various resources among various threads. The threads 
accessing the heap data structure and free list are synchro- 
nized by the operating system. In order to access the heap 
data structure and free list, a thread makes a call into the 
operating system. The actual access is performed at the 
operating system level. Consequently, by accessing heap 
data structure and free list at the operating system level, the 
accesses by each thread can be synchronized to prevent 
more than one thread from accessing the heap data structure 
and free hst at the same time. 

The operating system prevents simultaneous access to the 
heap data structure and free list by using spinlocks and 
interrupt masks. While accessing the heap data structure and 
free list through calls to the operating system prevents 
simultaneous access by the various threads, there are a 
number of associated drawbacks. The use of spinlocks and 
intermpt masking requires threads to wait while another 
thread is accessing the heap data stmcture or free list. 
Requiring threads to wait while another thread is accessing 
the heap data structure or free list substantially curtails the 
benefits of concurrent thread execution. As more CPUs are 
added, a bottleneck could potentially be created as each 
thread awaits access to the heap data stmcture and free list. 

Another problem occurs because of the transition from the 
thread to the operating system. Normally, while a thread is 

55 being performed, the instructions of the thread arc being 
executed, known as the application mode. When the thread 
makes a call to the operating system to access the heap data 
stmcture or free list, the access is performed at the operating 
system level, known as the kernel mode. Changing execu- 

50 tion modes causes substantial time delays. 

SUMMARY OF THE INVENTION 

The present invention is directed to a system and method 
for dynamically managing memory in a computer system by 
65 executing an instruction within an appUcation program 
causing the appHcation program to access a heap data 
stmcture and a free list containing the addresses of unallo- 



01/25/2004, EAST Version: 1.4.1 



us 6,4 

3 

cated regions of memory, determining the address of an 
appropriately sized region of memory, and allocating the 
region of memory to the application program. 

The present invention is also directed to a method for 
dynamically deallocating memory in a computer system by 
causing an application program to place the address of a 
region of memory in a free list, and modifying an entry in 
the hcap-^ta structure. 

BRIEF DESCRIPTION OF THE DRAWINGS 

A more complete understanding of the present invention 
may be had by reference to the following Detailed Descrip- 
tion when taken in conjunction with the accompanying 
drawings wherein: 

no. 1 is an illustration of a computer system embodying 
the present invention; 

FIG. 2 is an illustration of an exemplary operating system 
embodying the present invention; 

FIG. 3 is a diagram of system memory in accordance with 
the present invention; 

FIGS. 4A and 4B are diagrams of a heap data structure, a 
free list, and a heap in accordance with the present inven- 
tion; 

FIG. 5 is a flow chart illustrating the allocation of memory 
to an application program; and 

FIG. 6 is a flow chart illustrating the deallocation of 
memory from an application program. 

DETAILED DESCRIPTION OF THE DRAWINGS 

The numerous innovative teachings of the present appli- 
cation will be described with particular reference to pres- 
ently preferred exemplary embodiments. However, it should 
be understood that this class of embodiments provides only 
a few examples of the many advantageous uses of the 
innovative teachings herein. In general, statements made in 
the specification of the present application do not necessarily 
delimit any of the various claimed inventions. Moreover, 
some statements may apply to some inventive features but 
not to others. 

Referring now to the drawings wherein like or similar 
elements are designated with identical reference numerals 
throughout the several views, and wherein the various 
elements depicted are not necessarily drawn to scale, and, in 
particular lo FIG. 1, there is illustrated a schematic block 
diagram of a computer system 100. As illusUated, computer 
system 100 is a multiprocessor system and contains multiple 
host processors 110, 112, 114 and 116; system memory 119 
storing an operating system 118; and associated hardware 
130. As depicted, the associated hardware 130 includes 
items such as LAN controller 124, SCSI controller 126, an 
audio controller 128, and a graphics controller 132. 

As computer system 100 is a multiprocessing computer, it 
is able to execute multiple threads simultaneously, one for 
each of the processors therein. Further, it is contemplated 
that the computer system 100 can operate asymmetrically, 
symmetrically, or both symmetrically and asymmetrically. 

Referring now to FIG. 2, there is illustrated a more 
detailed block diagram of an exemplary operating system 
118. Applications 202 utilized in a computer system are kept 
separate from the operating system 118 itself. Operating 
system 118 runs in a privileged processor mode known as 
kernel-mode and has access to system data and hardware. 
Applications 202 run in a non-privileged processor mode 
known as user mode and have limited access to system data 
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and hardware through a set of tightly controlled application 
programming interfaces (APIs) 204. 

As depicted, the architecture of operating system 118 is a 
kernel based operating system. Operating system 118 

^ includes subsystems 210 (which operate in user mode), and 
system or executive services 212 (which operate in kernel 
mode). Executive services 212 may typically comprise mul- 
tiple components, such as the I/O manager 214, the object 
manager 216, the security reference monitor 219, the pro- 

10 cess manager 220, the local procedure call facility 222, the 
virtual memory manager 224, the kernel 226, and the 
hardware abstraction layer (HAL) 228. The components that 
make up the executive services provide basic operating 
system services to the subsystems 210 and to each other. The 

15 components are generally completely independent of one 
another and communicate through controlled interfaces. 

Still referring to FIG. 2, the I/O manager 214 manages all 
input and output for the operating system 118 including the 
managing of the communications between drivers of the 
computer system. Object manager 216 is for creating, 
managing, and deleting executive objects. Security refer- 
ence monitor 219 is utilized to ensure proper authorization 
before allowing access to system resources such as memory, 
I/O devices, files and directories. Process manager 220 
manages the creation and deletion of processes by providing 
a standard set of services for creating and using threads and 
processes in the context of a particular subsystem environ- 
ment. Local procedure call facility 222 is message-passing 
mechanism for controlling communication between the cli- 
ent and server when they are on the same machine. Virtual 
memory manager 224 maps virtual addresses in the process' 
address space to physical pages in the computer's memory. 
With further reference to FIG. 2, kernel 226 is the core of 

25 the architecture of operating system 118 and manages the 
most basic of the operating system functions. It is respon- 
sible for thread dispatching, multiprocessor synchronization, 
and hardware exception handling. The hardware abstraction 
layer (HAL) 228 is an isolation layer of software that hides, 
or abstracts, hardware differences from higher layers of the 
operating systems. Because of the HAL 228, the different 
types of hardware 130 all look alike to the operating system 
118, removing the need to specifically tailor the operating 
system to the hardware 130 with which it communicates. 
Ideally, the HAL 228 provides routines that allow a single 
device driver to support the same device on all platforms. 
HAL routines are called from both the base operating system 
218, including the kernel 226, and from the device drivers. 
The HAL 228 enables device drivers to support a wide 

5Q variety of I/O architectures without having to be extensively 
modified. The HAL 228 is also responsible for hiding the 
details of symmetric multiprocessing hardware from the rest 
of the operating system. 
An application 202 causes a processor 110, 112, 114 or 

55 116 to allocate a portion of memory 119 (see FIG. 1) called 
a heap by including an instruction, HeapCreate(n). When the 
processor 110, 112, 114 or 116 executes the command 
HeapCreate(n), a continuous number of bytes, 2^, are set 
aside, wherein M is equal to the lowest integer power of 2 

60 which equals or exceeds n. For example, for HeapCreate 
(5000), Mol3, and a heap 302 containing 2^ or 8192 bytes 
is set aside. 

Referring now to FIG. 3 a block diagram of the system 
memory 119 is described. Execution of a HeapCreate (n) 
65 command by a processor 110, 112, 114 or 116 causes a 
portion of system memory 119, or a heap 302 to be created. 
Associated with the heap 302 is a heap data stmaure 304 
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and a Free List 306. The heap 302 is a continuous portion of removing the entry 408, the N list 407 is checked to if the 

system memory 119 that is available for assignment to list 407 has become empty (step 504). When the list has 

various different applications 202. Heap Subportions 302A become empty, the N bit 402 in the heap daU structure 304 

can be created from the heap 302 and individually assigned is set to zero. In either case, the address of the heap 

to various applications 202. The heap 302 is managed 5 subregion 302A-302F contained in the entry 408 is then 

according to an algorithm known in the art as the Binary assigned to the application 202. 

Buddy Algorithm. In accordance with the Binary Buddy Still referring to FIG. 5, where the N bit 402 is 0, the 

Algorithm, all heap subportions 302A comprise T continu- processor increments N (step 510) and begins examining the 

ous bytes of memory, wherein N is an integer less than M. ^"^^ ^0^ of the heap data structure 304 in ascending order. 

Referring now to FIGS. 4A and 4B, more detailed dia- lO ""f^^^ °J N compared to M (step 512) and if N 

grams of the heap data structure 304, the free list 306, and exceeds M, then the there is no heap subregion 302A-302F 

the heap 302 of RG. 3 are illustrated. Referring to FIG. 4A, ^®^P "^Y^ accommodate a demand for X 

the heap data structure 304 comprises an array of up to M+1 ^^"^ "^^'"^ ^^^^ "PP^f^iT,^^^: Accordingly, the 

bits 402. The bits 402 are numbered starting from M and are "f 'P^^> mstrucUon will fail for the application 202 

sequentially descending until, at the minimum, zero. The 15 (f P ^l^)' long as N does not exceed M, the N bit 402 

free list 306 includes multiple Hsts 407 for each of the M+1 ^^^!?P ^^^^ ^ exammed (step 516). If the 

bits 402 in the heap data structure 304. ^ is not set to 1, N is mcremented (step 510) and the 

When the heap 302 is fir.! created, the Mbit M+1 bits 402 C^H'^t ilt^/"^ ^ repeated. Once an N bit 402 is 

in the heap data structure 304 is set to 1, while all of the ^^t^t ITlTAl^ ^V^T<2T^^V^^^ 

1 AM . . I L c 1- -50 hst 407, m the free list 306 (step 518). The N list 407 is now 

remainmg bits of the M+1 402 are set to zero. In the free list . xt r . • ♦ / » e^^^ «.t. 

-ift^ y , A- ♦ .u *x u . exammed to see if the N hsl 407 is empty (step 520), Where 

306, the list 407 correspondmg to the M bit contains an entry ..^ ^ ,. . Agyj ■ ^ ^, us am v ♦ / : ci-^x 

AO u- u • *u J J c.t. 41 . 1- * r L ^ix's the N list 407 is empty, the N bit 402 is set to zero (step 522). 

408 which IS the address ofthe first byte of the heap 302. The i„ ^;ti,«,™^ »u« ul:' ,.k™:^« -ia-^ a in-^ir p a. J 

u J . . / -ynA ^ c 1.. In either case, the heap subregion 302A-302F referred to by 

initial settings of the heap data Structure 304 and the free hst .u^ >ino • a- -a a - * * u u t. • . j 

mi: »u u aftt . • ' c the entry 408 is divided mto two heap subregions. It is noted 

306 indicate that the heap 302 contains a region of continu- a».;a' f ■ 4: i. . 

ous unassigned memory, 2" bytes in size, beginning at the 25 l'^^ r^'"^/ ""T"^ '^""^'^""S of 2^ bytes id 

address conuined in the entry 408 in the free list 306. JjJ^J^" 5"^'" ^° ^^lons of memory, each contammg 2^ 

Tliose skflled in the art will appreciate that as heap stiil referring to FIG. 5. N is decremented (step 530). The 

subregions 302A are assigned to various applications 202. .j^ress of the heap subregion with the higher memory 

the heap 302 wm contam regiojB of assigned memory address is entered into the N list 407 of the free list 306 (step 

scattered throughout the heap 302. llerefore. the unas- 533) n bit 402 in the heap data structure 302 is set 

signed memory in the heap 302 will be non-contmuous ^ ^ (^.^ 534) jf remaining half of the heap subregion 

Instead, the unassigned memory wjU compose a number of contains twic^ as much memory as is required by the 

regK.ns. Furthermore, because the heap contains 2" bytes of application, the remaining half of the heap sub- 

^h^Z """'^ 'T" "^.""^^"^ 35 region can be forther divided in half On the other hand, if 

2 bytes, where n<m, the region of unassigned memory will tuZ r^rr.^;»i^^ u^\f iu u • a * ♦ • 

™.«c.',.» If . u e u u • A i_ remaimng halt of the subregion docs not contain more 

consist of a number of heap subregions 302A, each con- ^^an twice as much memory as is need by the requesting 

taining a number of bytes equal to an integer power of 2. appUcation, the remaining half of the subregion should be 

Refernng now to FIG. 4B, the free list 306 contains lists assigned. Accordingly, a comparison is performed to see if 

407 of entries 408 of every starting address of unassigned ^ remaining half of the heap subregion contains at least 

heap subregions 302A-302F. twice as much memory as is required (step 536). If the 

The entries 408 are sorted according to the size of the remaining half contains more than twice as much memory as 

represented heap subregion 302A-302F, such that there is a is required, the remaining half is further divided in half (step 

list of unassigned heap subregions 302A-302F, for each 540), and steps 532-540 are repeated until a heap subregion 

integer power of 2, up to 2^. Each bit 402 of the heap data 45 is yielded which does not have twice as much memory as is 

structure 304 corresponds to a list in the free list 407 and is required. The subregion is then assigned to the application 

set to 1 if the list contains at least entry 408 containing an (step 542), 

address of at least one heap subregion 302A-302F. when the application 202 is finished using an assigned 

Accordingly, an application program can take a heap heap subregion 302A-302F, it can return the heap subregion 

subregion 302A-302F of the heap 302 by including an 50 302A-302F to the heap 302 by including an instmction 

instruction HeapAlloc(X, heapID) where X is the number of HeapFree(heapID, X), where heapID is a pointer which 

bytes required, and heapID is a pointer which will point to points to the starling address of the heap subregion 

the beginning address of the heap at the completion of the 302A-302F to be returned, and X is the number of bytes in 

instruction. the heap subregion 302A-302F. 

Referring now to FIG. 5, the process by which the 55 Referring now to FIG. 6, the process by which the 

processor 110, 112, 114, or 116 (see FIG. 1) executes the processor 110, 112, 114, or 116 executes the instruction 

instruction HeapAlloc(X, heapID) is described. Referring to HeapFree(heapID, X) is described. The processor 110, 112, 

FIG. 5, the processor 110, 112, 114, or 116 begins by 114, or 116 begins by determining an integer, N. such that 

determining the lowest power, N, of 2 which equals or (step 602). The processor 110, 112, 114, or 116 then 

exceeds X (step 501). For example, if X-1000, N-10 and 60 proceeds to determined whether the N bit 402 is set to 1 (step 

2^^-1024 bytes. The processor 110, 112, 114, or 116 can 604). Where the N bit 402 is set to zero, the processor 110, 

then examine the N bit 402 ofthe heap data structure 304 to 112, 114, or 116 sets the N bit 402 to 1 (step 606) and places 

determine if there is an appropriately sized heap subregion the address pointed to by the pointer heapID in the N list 407 

302A-302F in the free list 306 (step 502). Where the N bit (step 608), thus completing the insUuction. If the N bit 402 

is set, an entry 408 containing an address to a heap subregion 65 is set to 1, (at step 604), the processor 110, 112, 114, or 116 

302A-302F from the list 407 corresponding to the N bit 402 proceeds to examine the N list 407 (step 612). The processor 

(the N list) is removed from the N hst 407 (step 503). After 110, 112, 114, or 116 examines the entries 408 in the N list 
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407 to try to find what is known in the art as a "Binary 
Buddy/* When the heap 302 is first created, the heap 302 
contains 2^ continuous bytes of unassigned memory. 

As applications request assignment of memory, the heap 
302 is progressively partitioned in half, such as in steps 524 
and 540. The two partitions created in steps 524 or 540 are 
said to be Binary Buddies with respea to each other. In 
accordance with the Binary Buddy Algorithm, the processor 
110, 112, 114, or 116 seeks, where possible, to reunite 
partitions created in steps 524 or 540. Determining whether 
a Binary Buddy exists in the N list 407 (step 614) can be 
done in a number of different ways. In one embodiment, the 
address of the Binary Buddy can be recorded in a predeter- 
mined address of each subregion 302A-302F at the time of 
partitioning in step 524 and 540. In another embodiment, the 
address of the Binary Buddy can be implicitly determined by 
examining the address of the heap subregion 302A-302F. 
For example, if a heap 302 containing 2^ bytes begins at an 
address wherein the M least significant bits in the address are 
0, the address of the Binary Buddy for a heap subregion 
302A-302F can be determined by setting the N least sig- 
nificant bits of the address to zero and inverting the N+1 bit. 

If the Binary Buddy is not found, the N bit 402 in the heap 
data structure 304 is set to 1 (step 606) and the address 
referred to by heapID is placed in the N list 407 (step 608), 
completing execution of the HeapFree(heapID, X) instruc- 
tion. 

On the other hand, if a Binary Buddy is found (in step 
614), the entry 408 containing the address of the Binary 
Buddy is removed from the N list 407 (step 616). The N list 
407 is checked to see whether it is empty after removing the 
entry 408 containing the Binary Buddy (step 618). If the N 
list 402 is empty, the N bit 402 is set to zero (step 620). In 
either case, the Binary Buddy and the heap subregion 
302A-302F referred to by heapID are combined. The 
address of the first byte of either heaplD or the Binary 
Buddy, whichever has the lowest address is used as the 
starting address of the new heap subregion 302A-302F. The 
value of N is incremented (step 624) and the process (steps 
612-624) is repeated for the new heap subregion 
302A-302F, until the largest possible heap subregion 
302A-3Q2F without an unassigned Binary Buddy is placed 
in the free list 306. 

Based on the foregoing, those skilled in the art should 
now understand and appreciate that the invention provides 
an advantageous way to provide dynamic memory 
management, particularly in multiprocessing environments. 
Concurrent, non-blocking queues arc used to list the avail- 
able heap subrcgions which are available for assignment. 50 
Accordingly, application programs can concurrently access 
the heap data structure and the free list instead of having to 
block other applications. Because applications can concur- 
rently access the heap data structure and the free list, the 
bottleneck associated with the allocation of memory in a 55 
multiprocessing environment is substantially curtailed. 
Another benefit of enabling application programs to con- 
currently access the heap data structtire is that there is no 
longer a need for the operating system to arbitrate contention 
between two application programs attempting to allocate 
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memory. Accordingly, the performance delay incurred when 
switching from the user mode of the application program to 
the kernel mode of the operating system is eliminated. 

As will be recognized by those skilled in the art, the 
innovative concepts described in the present application can 
be modified and varied over a wide range of apphcations. 
Accordingly, the scope of the present invention should not 
be limited to any of the specific exemplary teachings 
discussed, but is only limited by the following claims. 

What is claimed is: 

1. In a computer system comprising an operating system, 
a plurality of application programs, and system memory, a 
method for allocating the system memory to the plurality of 
application programs, said method comprising the steps of: 

executing an executable instruction within a first apph- 
cation program of a plurality of application programs, 
such that the first application program accesses a heap 
data structure and a free list, wherein the heap data 
structure and the free list comprise a concurrent non- 
blocking queue; 

executing an executable instruction within a second appli- 
cation program of the plurality of application programs, 
such that the second application program accesses the 
heap data structure and the free list, the second appli- 
cation program accessing the heap data structure and 
the free list concurrently with the first application 
program to request a respective request amount of 
system memory for allocation to the first and second 
application programs; 

identifying, based on the heap data structure and the firee 
list, available portions of system memory, each of the 
available portions comprising at least the respective 
requested amount of system memory for each of the 
first and the second application programs; and 

allocating the respective requested amounts of system 
memory within the available portions to the first and 
the second application programs. 

2. In a computer system comprising an operating system, 
a first application program and a second application 
program, and system memory, a system for allocating said 
system memory to the first and second application programs 
comprising: 

a heap data structure for listing a size of at least one 
unallocated portion of the system memory; 

a free list for listing at least one address of the at least one 
unallocated portion of the system memory, wherein the 
heap data structure and the free list comprise a con- 
current non-blocking queue; and 

an executive instruction within the first and second appli- 
cation programs for accessing the free list and the heap 
data structure by the first and the second application 
programs, wherein the first application program 
accesses the free list and the heap data structure con- 
currently with the second application program to 
request a respective requested amount of the system 
memory for allocation to the first and second applica- 
tion programs. 



01/25/2004, EAST Version: 1.4.1 



